Section 1 - Introduction to dplyr

Load the dplyr and hflights package

We will learn the ins and outs of working with dplyr. dplyr is an R package, a collection of functions and data sets that enhance the R language.

Throughout this course you will use dplyr to analyze a data set of airline flight data containing flights that departed from Houston. This data is stored in a package called hflights.

Both dplyr and hflights are already installed on DataCamp’s servers, so loading them with library() will get you up and running.

install.packages("hflights")
# Load the dplyr package
library(dplyr)

# Load the hflights package
library(hflights)

# Call both head() and summary() on hflights
hflights %>% 
  head(10) %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
5424 2011 1 1 6 1400 1500 AA 428 N576AA 60 40 -10 0 IAH DFW 224 7 13 0 0
5425 2011 1 2 7 1401 1501 AA 428 N557AA 60 45 -9 1 IAH DFW 224 6 9 0 0
5426 2011 1 3 1 1352 1502 AA 428 N541AA 70 48 -8 -8 IAH DFW 224 5 17 0 0
5427 2011 1 4 2 1403 1513 AA 428 N403AA 70 39 3 3 IAH DFW 224 9 22 0 0
5428 2011 1 5 3 1405 1507 AA 428 N492AA 62 44 -3 5 IAH DFW 224 9 9 0 0
5429 2011 1 6 4 1359 1503 AA 428 N262AA 64 45 -7 -1 IAH DFW 224 6 13 0 0
5430 2011 1 7 5 1359 1509 AA 428 N493AA 70 43 -1 -1 IAH DFW 224 12 15 0 0
5431 2011 1 8 6 1355 1454 AA 428 N477AA 59 40 -16 -5 IAH DFW 224 7 12 0 0
5432 2011 1 9 7 1443 1554 AA 428 N476AA 71 41 44 43 IAH DFW 224 8 22 0 0
5433 2011 1 10 1 1443 1553 AA 428 N504AA 70 45 43 43 IAH DFW 224 6 19 0 0
sum_hfl <- as.data.frame(do.call(cbind, lapply(hflights, summary)))
## Warning in (function (..., deparse.level = 1) : number of rows of result is
## not a multiple of vector length (arg 1)
sum_hfl[,-c(1,2,3)] %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted
Min. 1 1 1 227496 1 227496 34 11 -70 -33 227496 227496 79 1 1 0 227496 0
1st Qu. 2 1021 1215 character 855 character 77 58 -8 -3 character character 376 4 10 0 character 0
Median 4 1416 1617 character 1696 character 128 107 0 0 character character 809 5 14 0 character 0
Mean 3.94769138798045 1395.7556135375 1578.25410595731 227496 1961.66314132996 227496 129.323744606341 108.142334527457 7.09433431305109 9.44495104434283 227496 227496 787.783244540563 6.098854876799 15.0910981567498 0.0130683616415234 227496 0.00285279741182263
3rd Qu. 6 1801 1953 character 2755 character 165 141 11 9 character character 1042 7 18 0 character 0
Max. 7 2400 2400 character 7290 character 575 549 978 981 character character 3904 165 163 1 character 1
NA’s 1 2905 3066 227496 1 227496 3622 3622 3622 2905 227496 227496 79 3066 2947 0 227496 0

Explore the data set

A data scientist must be familiar with his or her data. Experiment with the data set in the console and maybe try to generate some insightful plots. For your convenience, hflights is already loaded into the workspace.

How many observations and how many variables are contained in the hflights data set?

dim(hflights)
## [1] 227496     21

The variables are stored in the columns, whereas the observations are stored in the rows of this data set. As you can see, this data set is relatively large. In fact, it’s about 25Mb in size. For your convenience, we will continue the course’s exercises with a random subset of the data set.

Section 2 - tbl, a special type of data.frame

Convert data.frame to tibble

A tbl (pronounced tibble) is just a special kind of data.frame. They make your data easier to look at, but also easier to work with. On top of this, it is straightforward to derive a tbl from a data.frame structure using as_tibble().

The tbl format changes how R displays your data, but it does not change the data’s underlying data structure. A tbl inherits the original class of its input, in this case, a data.frame. This means that you can still manipulate the tbl as if it were a data.frame. In other words, you can do anything with the hflights tbl that you could do with the hflights data.frame.

# Both the dplyr and hflights packages are loaded

# Convert the hflights_df data.frame into a hflights tbl
hflights_df <- hflights[sample(nrow(hflights), 720), ] 
hflights <- as_tibble(hflights_df)

# Display the hflights tbl
hflights
## # A tibble: 720 x 21
##     Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
##    <int> <int>      <int>     <int>   <int>   <int> <chr>             <int>
##  1  2011     3         11         5    1544    1741 OO                 1170
##  2  2011     1         17         1    1252    1622 CO                 1850
##  3  2011     2          4         5    1503    1554 WN                 2477
##  4  2011    10          8         6    2058    2202 CO                 1019
##  5  2011     5         10         2    1525    1901 US                 1978
##  6  2011     2         25         5    1759    1858 XE                 2656
##  7  2011     7         11         1    1910    2012 XE                 2911
##  8  2011     4          2         6    1907    2007 CO                  479
##  9  2011    11          4         5    1635    1812 XE                 4277
## 10  2011     1         28         5     715     847 XE                 2606
## # ... with 710 more rows, and 13 more variables: TailNum <chr>,
## #   ActualElapsedTime <int>, AirTime <int>, ArrDelay <int>,
## #   DepDelay <int>, Origin <chr>, Dest <chr>, Distance <int>,
## #   TaxiIn <int>, TaxiOut <int>, Cancelled <int>, CancellationCode <chr>,
## #   Diverted <int>
# Create the object carriers
hflights$carriers <- hflights$UniqueCarrier

Changing labels of hflights, part 1 of 2

A bit of cleaning would be a good idea since the UniqueCarrier variable of hflights uses a confusing code system.

To do this, let’s work with a lookup table, that comes in the form of a named vector. When you subset the lookup table with a character string (like the character strings in UniqueCarrier), R will return the values of the lookup table that correspond to the names in the character string. To see how this works, run following code in the console:

two <- c("AA", "AS")
lut <- c("AA" = "American", 
         "AS" = "Alaska", 
         "B6" = "JetBlue")
two <- lut[two]
two
##         AA         AS 
## "American"   "Alaska"
# Both the dplyr and hflights packages are loaded into workspace
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental", 
         "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways", 
         "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier", 
         "FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")

# Add the Carrier column to hflights
hflights$Carrier <- lut[hflights$UniqueCarrier]

# Glimpse at hflights
glimpse(hflights)
## Observations: 720
## Variables: 23
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month             <int> 3, 1, 2, 10, 5, 2, 7, 4, 11, 1, 4, 7, 6, 2, ...
## $ DayofMonth        <int> 11, 17, 4, 8, 10, 25, 11, 2, 4, 28, 28, 23, ...
## $ DayOfWeek         <int> 5, 1, 5, 6, 2, 5, 1, 6, 5, 5, 4, 6, 4, 1, 4,...
## $ DepTime           <int> 1544, 1252, 1503, 2058, 1525, 1759, 1910, 19...
## $ ArrTime           <int> 1741, 1622, 1554, 2202, 1901, 1858, 2012, 20...
## $ UniqueCarrier     <chr> "OO", "CO", "WN", "CO", "US", "XE", "XE", "C...
## $ FlightNum         <int> 1170, 1850, 2477, 1019, 1978, 2656, 2911, 47...
## $ TailNum           <chr> "N742SK", "N32626", "N252WN", "N37409", "N53...
## $ ActualElapsedTime <int> 117, 150, 51, 124, 156, 119, 62, 60, 97, 92,...
## $ AirTime           <int> 94, 128, 38, 108, 129, 97, 41, 40, 77, 72, 5...
## $ ArrDelay          <int> 15, -10, 44, -18, 3, -10, -10, 1, 10, -6, -6...
## $ DepDelay          <int> 14, -3, 48, -2, -5, -6, -5, -3, 10, -5, -2, ...
## $ Origin            <chr> "IAH", "IAH", "HOU", "IAH", "IAH", "IAH", "I...
## $ Dest              <chr> "MCI", "RDU", "SAT", "DEN", "CLT", "ELP", "M...
## $ Distance          <int> 643, 1043, 192, 862, 913, 667, 262, 191, 562...
## $ TaxiIn            <int> 5, 3, 4, 7, 10, 3, 3, 2, 10, 8, 5, 7, 6, 2, ...
## $ TaxiOut           <int> 18, 19, 9, 9, 17, 19, 18, 18, 10, 12, 11, 6,...
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ carriers          <chr> "OO", "CO", "WN", "CO", "US", "XE", "XE", "C...
## $ Carrier           <chr> "SkyWest", "Continental", "Southwest", "Cont...

Changing labels of hflights, part 2 of 2

Let’s try a similar thing, but this time to change the labels in the CancellationCode column. This column lists reasons why a flight was cancelled using a non-informative alphabetical code. Execute

unique(hflights$CancellationCode)

A lookup table lut has already been created for you, that converts the alphabetical codes into more meaningful strings.

# The hflights tbl you built in the previous exercise is available in the workspace.

# The lookup table
lut <- c("A" = "carrier", "B" = "weather", "C" = "FFA", "D" = "security", "E" = "not cancelled")

# Add the Code column
hflights$Code <- lut[hflights$CancellationCode]

# Glimpse at hflights
glimpse(hflights)
## Observations: 720
## Variables: 24
## $ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
## $ Month             <int> 3, 1, 2, 10, 5, 2, 7, 4, 11, 1, 4, 7, 6, 2, ...
## $ DayofMonth        <int> 11, 17, 4, 8, 10, 25, 11, 2, 4, 28, 28, 23, ...
## $ DayOfWeek         <int> 5, 1, 5, 6, 2, 5, 1, 6, 5, 5, 4, 6, 4, 1, 4,...
## $ DepTime           <int> 1544, 1252, 1503, 2058, 1525, 1759, 1910, 19...
## $ ArrTime           <int> 1741, 1622, 1554, 2202, 1901, 1858, 2012, 20...
## $ UniqueCarrier     <chr> "OO", "CO", "WN", "CO", "US", "XE", "XE", "C...
## $ FlightNum         <int> 1170, 1850, 2477, 1019, 1978, 2656, 2911, 47...
## $ TailNum           <chr> "N742SK", "N32626", "N252WN", "N37409", "N53...
## $ ActualElapsedTime <int> 117, 150, 51, 124, 156, 119, 62, 60, 97, 92,...
## $ AirTime           <int> 94, 128, 38, 108, 129, 97, 41, 40, 77, 72, 5...
## $ ArrDelay          <int> 15, -10, 44, -18, 3, -10, -10, 1, 10, -6, -6...
## $ DepDelay          <int> 14, -3, 48, -2, -5, -6, -5, -3, 10, -5, -2, ...
## $ Origin            <chr> "IAH", "IAH", "HOU", "IAH", "IAH", "IAH", "I...
## $ Dest              <chr> "MCI", "RDU", "SAT", "DEN", "CLT", "ELP", "M...
## $ Distance          <int> 643, 1043, 192, 862, 913, 667, 262, 191, 562...
## $ TaxiIn            <int> 5, 3, 4, 7, 10, 3, 3, 2, 10, 8, 5, 7, 6, 2, ...
## $ TaxiOut           <int> 18, 19, 9, 9, 17, 19, 18, 18, 10, 12, 11, 6,...
## $ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", ...
## $ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ carriers          <chr> "OO", "CO", "WN", "CO", "US", "XE", "XE", "C...
## $ Carrier           <chr> "SkyWest", "Continental", "Southwest", "Cont...
## $ Code              <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

Session info

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252   
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                       
## [5] LC_TIME=German_Switzerland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] hflights_0.1     ggplot2_3.1.0    dplyr_0.8.0.1    gapminder_0.3.0 
## [5] kableExtra_1.0.1 knitr_1.21      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0        highr_0.7         plyr_1.8.4       
##  [4] pillar_1.3.1      compiler_3.5.2    prettydoc_0.2.1  
##  [7] tools_3.5.2       digest_0.6.18     gtable_0.2.0     
## [10] evaluate_0.12     tibble_2.0.1      viridisLite_0.3.0
## [13] pkgconfig_2.0.2   rlang_0.3.1       cli_1.0.1        
## [16] rstudioapi_0.9.0  yaml_2.2.0        xfun_0.4         
## [19] withr_2.1.2       httr_1.4.0        stringr_1.4.0    
## [22] xml2_1.2.0        hms_0.4.2         webshot_0.5.1    
## [25] grid_3.5.2        tidyselect_0.2.5  glue_1.3.0       
## [28] R6_2.4.0          fansi_0.4.0       rmarkdown_1.11   
## [31] readr_1.3.1       purrr_0.3.0       magrittr_1.5     
## [34] scales_1.0.0      htmltools_0.3.6   assertthat_0.2.0 
## [37] rvest_0.3.2       colorspace_1.4-0  utf8_1.1.4       
## [40] stringi_1.3.1     lazyeval_0.2.1    munsell_0.5.0    
## [43] crayon_1.3.4